Let's say we have p^=NX=1000100=0.1 where X= # of users who clicked, and N= # of users.
A Rule of Thumb for normality is to check N⋅p^>5 (might as well check N⋅(1−p^)>5), otherwise use t-distribution instead of z-distribution.
The margen of errorm=zα/2⋅SE=zα/2⋅Np^(1−p^), notice that here SE=Np(1−p) instead of np(1−p) for binomial distribution, since we use the fraction or proportion of successes instead of the total number of sucesses.
For α=5%, we have m=z0.025⋅10000.1∗0.9=0.019, and the final 95% CI is [0.081,0.119].
1.17 Null and Alternative Hypothesis, Two-tailed vs. One-tailed tests
The null hypothesis and alternative hypothesis proposed here correspond to a two-tailed test, which allows you to distinguish between three cases:
A statistically significant positive result
A statistically significant negative result
No statistically significant difference.
Sometimes when people run A/B tests, they will use a one-tailed test, which only allows you to distinguish between two cases:
A statistically significant positive result
No statistically significant result
Which one you should use depends on what action you will take based on the results.
If you're going to launch the experiment for a statistically significant positive change, and otherwise not, then you don't need to distinguish between a negative result and no result, so a one-tailed test is good enough. If you want to learn the direction of the difference, then a two-tailed test is necessary.
1.19 Pooled Standard Error
We have Xcont,Xexp,Ncont,Nexp, and
Pooled sample mean p^pool=Ncont+NexpXcont+Xexp
Pooled sample standard error SEpool=p^pool(1−p^pool)(Ncont1+Nexp1)
Test statistic d^=p^exp−p^cont
Null hypothesis H0:d=0, under which d^∼N(0,SEpool)
For 95% confidence level (z1−0.05/2=1.96), if d^>1.96×SEpool or d^<−1.96×SEpool, reject the null.
1.21 - 24. Sample Size and Power
Two types of error
α=P(reject null | null True)
β=P(not reject null | null False)
So if sample is small, we have low α and high β, i.e., harder to identify the alternative when a difference exists. On the other hand, if sample is large, α is the same, but β is much lower, as shown below.
Sample size = 1000
Sample size = 5000
1−β is called sensitivity and often choose to be >80%
Note on power
Statistical textbooks often define power as the sensitivity. However, conversationally power often means the probability that your test draws the correct conclusions, which depends on both α and β.
Required sample size to achieve certain statistical power can be calculated using online calculator, in which you need to specify α, β, baseline conversion rate (null), and minimum detectable effect (alternative).
Final notes on how typer I & II confidence level and detectable difference dmin can determine the required sample size together is as follows
Examples of factors that affect the required sample size are as follows:
1.25 Pooled Example
An pooled example is shown below, notice how the dmin works (need the lower bond of the 1−α level CI >dmin=0.02)
1.26 Confidence Interval Case Breakdown
Shown below is the how we should consider the decision under varying CI and dmin cases
Lesson 2: Policy and Ethics for Experiments
2.1 - 2.7. Four Principles
IRB's four main principles to consider when conducting experimentats are:
Risk: what risk is the participant undertaking?. The main threshold is whether the risk exceeds that of “minimal risk”. Minimal risk is defined as the probability and magnitude of harm that a participant would encounter in normal daily life.
Benefit: what benefits might result from the study?
Choice/Alternatives: what other choices do participants have?
Privacy/Data Sensitivity: what data is being collected, and what is the expectation of privacy and confidentiality?
How sensitive is the data?
What is the re-identification risk of individuals from the data?
2.8 Accessing Data Sensitivity
An example of data sensitivity assessment is shown below
2.10 Summary of Principles
It's a grey area whether internet studies should be subject to IRB review or not and whether informed consent is required.
Most studies face the bigger question about data collection with regards to identifiability, privacy, and confidentiality / security.
Are participants facing more than minimal risk?
Do participants understand what data is being gathered?
Is that data identifiable?
How is the data handled?
Lesson 3: Choosing and Characterizing Metrics
3.2 - 3.3 Metric Definition Overview
Invariant Checking: metrics shouldn't change across experiment and control
Evaluation: what do we want to use the metrics for?
At the evaluation stage, it's better to settle on one single objective that multiple departments within the company would most likely agree on.
If mutlple metrics are available or equally important, we can create a composite metric, e.g., something called objective function or OEC (Overall Evaluation Criterion, a term created by Microsoft).
Composite metric is less preferred, as it is better to come up with a less optimal metric that works for a suite of A/B tests than to come up with a perfect metric but only for a single test.
3.5 Refining the Customer Funnel
An example of defining metrics for Udacity
Refining the customer funnel
High-level metrics
3.6 - 3.7 Quizes on Choosing Metrics
How to choose metrics for different tests
Difficult metrics
Don't have access to data, e.g.,
Amazon wants to measure average happiness of shoppers
Google wants to measure probability of user finding information via search
Takes too long to measure, e.g.,
Udacity measures the rate of customers who completed the 1st course returning for 2nd one.
3.8 Other techniques for defining metrics
External data
User experience research, surveys, focus groups
Retrospective analysis helps detect correlations for us to develop theories.
3.10 - 11 Techniques to Gather Additional Data and Examples
Techniques for gather additional data
Udacity example
Examples where data is hard to get
3.13 Metric Definition: Click Through Example
Metric definition
3.16 - 3.17 Summary Metrics
Categories of summary metrics
Sums and counts.
e.g., # users who visited page
Means, medians, and percentiles
e.g., mean age of users who completed a course or
median latency of page load
Probabilities and rates
Probability has 0 or 1 outcome in each case
Rate has 0 or more
Ratios
e.g., P(any click)P(revenue-generating click)
3.18 - 3.19 Sensitivity and Robustness
We want summary metrics to be sensitive on things we care and robust on things we don't care.
Example: choose summary metric for latency of a video
Use retrospective analysis to check robustness. For example, if we plot distribution for similar videos and find the 95th and 99th percentiles of load time has noticeable variations between videos, those two metrics may not be robust enough.
We can also look at experimental data. For example, if we plot distribution of load time for videos with increasing resolution, and find that the median and 80th percentile is not affected by resolution, only the 85/90/95-th percentiles are increasing. This means that median and 80th percentile may not be sensitive enough.
3.20 Absolute Versus Relative Differences
Usually start with absolute differences when we don't know the metric well.
Using relative difference means we might be able to stick with the same significance boundary and not need to worry about seasonality factors (e.g., think about CTR for shopping websites)
Some summary metrics may be harder to analyze. E.g., median could be non-normal if data is non-normal (e.g., latency with bimodal distribution shown below)
Example: calculate the 95% CI for a mean with N = [87029, 113407, 84843, 104994, 99327, 92052, 60684]
Estimate variance and calculate CI using pooled results
Directly estimate confidence interval from empirical distribution
We can also use bootstrap to generate multiple samples/metrics to estiamte the variability.
Lesson 4: Designing an Experiment
4.2 - 4.3 Unit of Diversion Overview
Unit of diversion is how we define what an individual subject is in the experiment.
Commonly used:
User id
Stable, unchanging
Personally identifiable
Anonymous id (cookie)
Changes when you switch browser or device
Users can clear cookies
Event
No consistent experience
use only for non-user-visible changes
Less common:
Device id
only available for mobile
tied to specific device
unchangeable by user
IP address
changes when location changes
Example
4.4 - 4.5 Consistency of Diversion
First principle of choosing unit of diversion is to make sure users have consistent experience.
If the customer wouldn't be likely to notice the change, we might want to start with event-based experiement. If learning effect is detected later, we can switch to a cookie-based experiment.
Example
4.6 - 4.7 Ethical Considerations
An exmaple is as follows.
Notice that only the second case requires additional sthical review/consent from the user because it might comprimise the anonimity of cookie-based data.
4.8 - 4.9 Unity of Analysis vs. Diversion
Unit of analysis is basically whatever your denominator of the analysis is.
In an interleaved ranking experiment, suppose you have two ranking algorithms, X and Y. Algorithm X would show results X1,X2,…XN in that order, and algorithm Y would show Y1,Y2,…YN. An interleaved experiment would show some interleaving of those results, for example, X1,Y1,X2,Y2,… with duplicate results removed. One way to measure this would be by comparing the click-through-rate or -probability of the results from the two algorithms. For more detail, see Large-Scale Validation and Analysis of Interleaved Search Evaluation.
The rule of thumb is to think about what if the worst possible impact if everything goes wrong.
4.23 Learning Effects
Change aversion vs. novalty effect
To measure learning effect, we need a stateful unit of diversion like a cookie or a user ID
Better use a cohort as opposed to just a population to measure the effect of dosage (e.g., how frequent a subject sees the change)
Risk vs. duration
Use A/A test is useful in both pre- and post- experiment.
Lesson 5: Analyzing Results
Outline of this section
Sanity Checks
Single Metric
Multiple Metrics
Gotchas
5.1 - 5.7 Sanity Checks (invariant metrics)
Check invariant metrics
Check population sizing metrics to make sure control and experiment groups are comparable
Check actual invariant metrics
Quizes
population sizing metrics
invariant metrics
Checking invariants
5.8 - 5.9 Single Metric
What not to do if your results aren't significant
Carrie gave some ideas of what you can do if your results aren't significant, but you were expecting they would be. One tempting idea is to run the experiment for a few more days and see if the extra data helps get you a significant result. However, this can lead to a much higher false positive rate than you expecting! See the post (How Not To Run an A/B Test) for more details. Instead of running for longer when you don't like the results, you should be sizing your experiment in advance to ensure that you will have enough power the first time you look at your results.
Example
Notice that the confidence interval does not include the detectable difference dmin.
Test the probability of observaing 7 successes out of 7 experiments with success rate of 0.5. The two-tail P-value is 0.0156(<α=0.05) , which is the probability of observing < 0 or > 7 successes in 7 experiments. Based on this, it's highly unlikely that the positive changes of CTR in the experiment group is due to chance, so we recommend launch.
Another example
Notice that the CI includes dmin so we cannot recommend to launch
9 successes out of 14 days, with a two-tail P value of 0.4240 we cannot reject the null that this is due to pure chance
Overall the effect size (in lifting CTR) shows significant result, but the sign test failed. Digging deeper into the day-by-day data we further observed that the effect is more significant for weekends and not significant for weekdays.
5.10 - 11. Simpson's Paradox
An example of Simpson's paradox in A/B test is when your results within new/experienced user groups are consistent, but the aggregated result in the total population shows the reverse.
5.12 - 5.15. Multiple Metrics
Bonferroni correction can guarantee the overall FP rate αoverall by controlling individual FP rate αindividual=αoverall/m. However, it is too conservative.
A audacity example where Bonferroni is too conservative. In the example below (Z∗ is the critical value corresponding to the confidence level, and m is the margin of error), three metrics that showed significant difference individually would be rejected if we use Bonferroni to keep αoverall at the same level (i.e., 0.05), which is probably too conservative.
Practical recommendations to counter the conservatism of Bonferroni correction includes:
Rigorous answer: Use a more sophisticated method (see next)
In practice: Judgement call, possibly based on business strategy
Often need to go deeper to understand the user to find reasons for conflicting metrics or outcomes among cohorts.
5.18. Changes Over Time
Ramp up the (sample size, user groups) of experiments. However, results may not be repeatable due to changes over time (i.e., due to seasonal-driven impact)
We can keep a holdout/holdback group that don't get any chagnes to track seasonality effect.
Lesson 6: Final Project
As of March 1st, 2020, I have gone through the first five (video) lessons of the course, and will take an indefinite leave of absense from finishing the final project lesson. Some resources are listed as below.